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Abstract 

We consider the group lasso penalty for the linear model. We note that 
the standard algorithm for solving the problem assumes that the model 
matrices in each group are orthonormal. Here we consider a more general 
penalty that blends the lasso {L\) with the group lasso ( "two- norm" ) . This 
penalty yields solutions that are sparse at both the group and individual 
feature levels. We derive an efficient algorithm for the resulting convex 
problem based on coordinate descent. This algorithm can also be used 
to solve the general form of the group lasso, with non-orthonormal model 
matrices. 



1 Introduction 

In this note, we consider the problem of prediction using a linear model. Our data 
consist of y, a vector of N observations, and X, a Af x p matrix of features. 

Suppose that the p predictors are divided into L groups, with pi the number 
in group I. For ease of notation, we use a matrix X<? to represent the predictors 
corresponding to the £th group, with corresponding coefficient vector Assume 
that y and X has been centered, that is, all variables have mean zero. 
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In an elegant paper, Yuan & Lin (2007) proposed the group lasso which solves 
the convex optimization problem 



mm 

/36RP 

where the ^fpl terms accounts for the varying group sizes, and || • H2 is the Eu- 
clidean norm (not squared). This procedure acts like the lasso at the group level: 
depending on A, an entire group of predictors may drop out of the model. In fact 
if the group sizes are all one, it reduces to the lasso. Meier et al. (2008) extend 
the group lasso to logistic regression. 

The group lasso does not, however, yield sparsity within a group. That is, 
if a group of parameters is non-zero, they will all be non-zero. In this note 
we propose a more general penalty that yields sparsity at both the group and 
individual feature levels, in order to select groups and predictors within a group. 
We also point out that the algorithm proposed by Yuan & Lin (2007) for fitting 
the group lasso assumes that the model matrices in each group are orthonormal. 
The algorithm that we provide for our more general criterion also works for the 
standard group lasso with non-orthonormal model matrices. 

We consider the sparse group lasso criterion 
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|y-^X^||^ + A 1 ^||^|| 2 + A 2 ||/3|| 1 . (2) 



e=i 1=1 



where /9 = 02, ■ ■ ■ Pe) is the entire parameter vector. For notational simplicity 
we omit the weights Expression (T5]) is the sum of convex functions and is 

therefore convex. Figure [1] shows the constraint region for the group lasso, lasso 
and sparse group lasso. A similar penalty involving both group lasso and lasso 
terms is discussed in Peng et al. (2009). When A 2 = 0, criterion ([2]) reduces to 
the group lasso, whose computation we discuss next. 



2 Computation for the group lasso 

Here we briefly review the computation for the group lasso of Yuan & Lin (2007). 
In the process we clarify a confusing issue regarding orthonormality of predictors 
within a group. 

The subgradient equations (see e.g. Bertsekas (1999)) for the group lasso are 

- X.J(y - ^X £ /3 £ ) + A ■ s t = 0; t = 1,2, . . . L, (3) 

i 
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Figure 1: Contour lines for the penalty for the group lasso (dotted), lasso (dashed) and 
sparse group lasso penalty (solid), for a single group with two predictors. 

where S£ = Pi/\\(3e\\ if ^ and si is a vector with ||s^||2 < 1 otherwise. Let the 
solutions be fii, f3 2 ■ ■ ■ fit- If 

l|Xj(l/-^X fc 4)||<A (4) 

then Pi is zero; otherwise it satisfies 

P e = (XjX i + \/\\0 t \\)- 1 Xjr t (5) 

where 

Now if we assume that XjX^ = I, and let se = Xjrp, then ([5]) simplifies to 
0i = (1 — A/||s^||)s|. This leads to an algorithm that cycles through the groups 
k, and is a blockwise coordinate descent procedure. It is given in Yuan & Lin 
(2007). 

If however the predictors are not orthonormal, one approach is to orthonormal- 
ize them before applying the group lasso. However this will not generally provide 
a solution to the original problem. In detail, if = UDV T , then the columns of 
U = X/VD 1 are orthonormal. Then X t t = UVD" 1 /^ = UfVD- 1 /^] = U/3 e *. 
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But \\/3£ * || = \ \Pe\ \ only if D = I. This will not be true in general, e.g. if X is a 
set of dummy varables for a factor, this is true only if the number of observations 
in each category is equal. 

Hence an alternative approach is needed. In the non-orthonormal case, we can 
think of equation (jSj) as a ridge regression, with the ridge parameter depending 
on \\(3i\\. A complicated scalar equation can be derived for \\/3i\\ from (jSj); then 
substituting into the right-hand side of © gives the solution. However this is not 
a good approach numerically, as it can involve dividing by the norm of a vector 
that is very close to zero. It is also not guaranteed to converge. In the next section 
we provide a better solution to this problem, and to the sparse group lasso. 



3 Computation for the sparse group lasso 

The criterion ([!]) is separable so that block coordinate descent can be used for its 
optimization. Therefore we focus on just one group £, and denote the predictors 
by = Z = (Zi, Z 2 , . . . Z k ), the coefficients by j3g = 9 = 9 2 , . . . 9k) and the 
residual by r = y — J^k^e-X-kPk- The subgradient equations are 

- Zj(r - ZjOj) + Ai«i + Uj = (6) 

3 

for j — 1, 2, ... k where Sj = 9j/\\$\\ if 9g ^ and s is a vector satisfying | |s| I2 < 1 
otherwise, and tj G sign(6 l j), that is tj = sign(9j) if 9j 7^ and tj G [—1,1] if 
9j = 0. Letting a = X^r, then a necessary and sufficient condition for 9 to be zero 
is that the system of equations 

a, = XiSj + \ 2 tj (7) 

have a solution with ||s||2 < 1 and tj G [—1,1]. We can determine this by 
minimizing 

k k 
^) = (1/A 1 )EK-M0 2 = E s ? ( 8 ) 

with respect to the tj G [—1,1] and then checking if J(t) < 1. The minimizer is 
easily seen to be 
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Now if J {t) > 1, then we must minimize the criterion 

N k k 

2E( r «-E^) +^\\9\\2 + X2j2\^\ ( 9 ) 

i=l j=l j=l 
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This is the sum of a convex differentiable function (first two terms) and a separable 
penalty, and hence we can use coordinate descent to obtain the global minimum. 

Here are the details of the coordinate descent procedure. For each j let Tj = 
r — Y2k^j Zk^k- Then 9j = if \Zjrj\ < \ 2 . This follows easily by examining the 
subgradient equation corresponding to (JHJ). Otherwise if \ZJtj\ > X 2 we minimize 
(jSJ) by a one-dimensional search over 9j. We use the optimize function in the R 
package, which is a combination of golden section search and successive parabolic 
interpolation. 

This leads to the following algorithm: 

Algorithm for the sparse group lasso 

1. Start with j3 — fa 

2. In group £ define r e = y-J2k^e" X -kfa,"X- i = (Z h Z 2 , . . . Z k ), fa = {9 U 9 2 , . . . 9 k ) 
and Tj — y' — J2k=ij Zk@k- Check if J(t) < 1 according to (JSJ) and if so set 
fa = 0. Otherwise for j = 1, 2, ... k, if \Zjrj\ < \ 2 then 9j = 0; if instead 

^1 



\ZTrj\ > \ 2 then minimize 



N k k 

2 /2(y'i -J2 z m 2 + A i 1 i*i b» + 1^1 ( 10 ) 

j=l j=i i=l 
over by a one-dimensional optimization. 
3. Iterate step (2) over groups £ — 1,2, ... L until convergence. 

If A 2 is zero, we instead use condition (j3j) for the group-level test and we don't need 
to check the condition \Zjrj\ < \ 2 . With these modifications, this algorithm also 
gives a effective method for solving the group lasso with non-orthogonal model 
matrices. 

Note that in the special case where XjX^ = J, with = (Z\, Z 2 , . . . Z k ) then 
its is easy to show that 

and this reduces to the algorithm of Yuan & Lin (2007). 



4 An example 

We generated n = 200 observations with p = 100 predictors, in ten blocks of ten. 
The second fifty predictors iall have coefficients of zero. The number of non-zero 
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coefficients in the first five blocks of 10 are (10, 8, 6, 4, 2, 1) respectively, with 
coefficients equal to ±1, the sign chosen at random. The predictors are standard 
Gaussian with correlation 0.2 within a group and zero otherwise. Finally, Gaussian 
noise with standard deviation 4.0 was added to each observation. 

Figure |5] shows the signs of the estimated coefficients from the lasso, group lasso 
and sparse group lasso, using a well chosen tuning parameter for each method (we 
set Ai = A 2 for the sparse group lasso). The corresponding misclassification rates 
for the groups and individual features are shown in Figure [3j We see that the 
sparse group lasso strikes an effective compromise between the lasso and group 
lasso, yielding sparseness at the group and individual predictor levels. 
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Figure 2: Results for the simulated example. True coefficients are indicated by the open 
triangles while the filled green circles indicate the sign of the estimated coefficients from 
each method. 
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Regularization parameter 




Figure 3: Results for the simulated example. The top panel shows the number of groups 
that are misclassified as the regularization parameter is varied. A misclassified group 
is one with at least one nonzero coefficient whose estimated coefficients are all set to 
zero, or vice versa. The bottom panel shows the number of individual coefficients that 
are misclassified, that is, estimated to be zero when the true coefficient is nonzero or 
vice-versa. 
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